Skip to content

fast-interp: relaxed-SIMD opcode lowering#4950

Open
matthargett wants to merge 9 commits into
bytecodealliance:mainfrom
rebeckerspecialties:feat/relaxed-simd-fast-interp
Open

fast-interp: relaxed-SIMD opcode lowering#4950
matthargett wants to merge 9 commits into
bytecodealliance:mainfrom
rebeckerspecialties:feat/relaxed-simd-fast-interp

Conversation

@matthargett
Copy link
Copy Markdown

@matthargett matthargett commented May 21, 2026

Implements the 20 relaxed-SIMD sub-opcodes (0x100..0x113) in the fast-interp
HANDLE_OP(WASM_OP_SIMD_PREFIX) switch and adds a WAMR_BUILD_RELAXED_SIMD cmake
flag (default off — opt-in). Currently those sub-opcodes hit the
"unsupported SIMD opcode" arm at wasm_interp_fast.c:7474. Hand-built
implementations for the four ops SIMDe doesn't ship (relaxed_q15mulr_s + the two
relaxed_dot_i8x16_i7x16* variants); the rest route through simde/wasm/relaxed-simd.h.

Why we built this: we're replacing WasmEdge with WAMR fast-interp as the wasm
runtime in a pure-interpreter App-Store-eligible app, and the audio DSP
path (a modified version of xmrsplayer) uses f32x4.relaxed_madd to reach the interpreter-only performance that we need. Without this, fast-interp traps at load on every simd128 workload we have that we introduced to reduce opcode dispatch pressure/overhead in interpreters.

Test coverage — three layers, 174 conformance checks:

  • WAMR unit tests in tests/unit/relaxed-simd/ (load + invoke + boundary regressions).
  • 32 hand-rolled abuse cases + 76 differential comparisons against wasmtime's
    Config::relaxed_simd_deterministic(true) mode
    in our benchmark repo. The
    diff-fuzz layer caught a spec-violating impl of
    i32x4.relaxed_dot_i8x16_i7x16_add_s before submission — an off-by-i16-
    truncation that produced lane values outside the spec-allowed set. The
    upstream spec testsuite did not catch it: every existing assertion stays
    within the i16 pair-sum range. Fix is in this PR; corresponding spec-test
    addition at WebAssembly/relaxed-simd#164.
  • 69 upstream WebAssembly/relaxed-simd spec-testsuite assertions wired up
    through fast-interp with (either …) membership semantics.

Cross-microarch benchmarks (M4 Lion P / Sawtooth E / A14 Icestorm / A12 Tempest /
S8 Watch SE2) at
https://github.com/rebeckerspecialties/wasm-benchmark/blob/claude/relaxed-simd-diff-fuzz/README.md#cross-runtime-results-across-apple-silicon-e-cores .
ASan + UBSan + fuzzing part of my local dev loop to find corner cases.

Companion PR: legacy exception support #4949

…-SIMD

The relaxed-SIMD proposal — finalized as a wasm 2.0 extension — uses
the same 0xfd SIMD prefix and reserves sub-opcodes `0x100..0x113`
for its 20 new ops:

  relaxed_swizzle, relaxed_trunc_{f32x4,f64x2}_{s,u},
  relaxed_madd / relaxed_nmadd for f32x4 + f64x2,
  relaxed_laneselect for i8 / i16 / i32 / i64,
  relaxed_min / relaxed_max for f32x4 + f64x2,
  relaxed_q15mulr_s,
  relaxed_dot_i8x16_i7x16_{s,_add_s}.

This commit lays the loader-side validation needed to *recognize*
these opcodes without changing dispatch / runtime behaviour:

  * `WASMSimdEXTOpcode` enum (wasm_opcode.h) extended with the 20
    new constants at the spec-assigned values 0x100..0x113. Gated
    behind `WASM_ENABLE_RELAXED_SIMD != 0` so a build without the
    cmake flag (added in a follow-up commit) sees no new symbols
    and the enum's storage is unchanged.

  * `wasm_loader_find_block_addr` SIMD-prefix immediate skipper
    (wasm_loader.c:8273-8363) — the inner switch is now on the
    raw LEB-uint32 sub-opcode instead of the `(uint8)` cast, so
    relaxed-SIMD sub-opcodes 0x100..0x113 reach their own case
    labels instead of aliasing into legacy slots 0x00..0x13 and
    triggering wrong `skip_leb_*` paths. Relaxed-SIMD opcodes
    carry no immediates beyond the prefix, so the new cases just
    `break` — listed explicitly so a future SIMD-spec assignment
    in 0x100..0x113 doesn't silently fall through the default
    branch and silently mis-skip an immediate. Cast assignment to
    the outer `opcode` variable removed since it's no longer
    used by the inner switch (the outer-function switch already
    matched `WASM_OP_SIMD_PREFIX` and is inside that case).

  * `wasm_loader_prepare_bytecode` SIMD-prefix type checker
    (wasm_loader.c:16186+) — extended with type-signature case
    labels for each relaxed-SIMD opcode. Three signature classes:

      unary  (1 v128 -> 1 v128): the four trunc variants.
      binary (2 v128 -> 1 v128): swizzle, min/max, q15mulr,
                                 dot_i8x16_i7x16_s.
      ternary(3 v128 -> 1 v128): madd, nmadd, laneselect,
                                 dot_i8x16_i7x16_add_s.

    The 3-input ternary shape uses `POP_V128()` + `POP2_AND_PUSH`,
    mirroring how `SIMD_v128_bitselect` handles its 3-input shape
    today — no new stack-tracker macro needed.

  * The trailing `default:` branch in the type checker keeps
    rejecting unrecognized SIMD sub-opcodes with
    `"invalid opcode 0xfd %02x."`, which now correctly surfaces
    the full uint32 value (relaxed-SIMD opcodes 0x100+ are
    rendered as e.g. `0xfd 100` — the `%02x` width is a minimum,
    not a truncation).

The runtime executor (the actual case bodies in
`HANDLE_OP(WASM_OP_SIMD_PREFIX)` and the IR encoder widening
needed to reach them past the existing 1-byte sub-opcode read)
is the follow-up commit. Cmake `WAMR_BUILD_RELAXED_SIMD` flag
that flips `WASM_ENABLE_RELAXED_SIMD=1` is the third commit.
Built clean against `cd390ea0` with the flag absent — no
binary or behavioural change to existing SIMD code.

References:
  https://github.com/WebAssembly/relaxed-simd/blob/main/proposals/relaxed-simd/Overview.md
  https://github.com/WebAssembly/relaxed-simd/blob/main/proposals/relaxed-simd/_md/instructions.md
The 20 relaxed-SIMD ops (`0x100..0x113`) land as new case bodies
inside the existing `HANDLE_OP(WASM_OP_SIMD_PREFIX)` switch in
`wasm_interp_fast.c`. Each case follows the legacy SIMD-case
shape: pop the v128 operand(s) from `frame_lp`, hand them to a
SIMDe intrinsic (or a hand lane loop for the three SIMDe-missing
ops), write one v128 result.

To reach a case past 0xff the SIMD sub-opcode is widened from a
single byte to a little-endian uint16 in the IR. The loader emits
two consecutive bytes via `wasm_loader_emit_int16` (STORE_U16, no
padding even on platforms without unaligned access). The runtime
reads `frame_ip[0] | (frame_ip[1] << 8)` and switches over the
full `0x000..0x113` range. The widening is conditional on
`WASM_ENABLE_RELAXED_SIMD != 0`; when off, the IR is still
1-byte-per-SIMD-op via `emit_byte` and the runtime dispatch is
the legacy `GET_OPCODE()` 1-byte read — byte-identical to
upstream.

Per-case dispatch:

  swizzle (i8x16 .relaxed_swizzle)                    DOUBLE
  trunc_{f32x4,f64x2}_{s,u,_zero}    (4 unary)         SINGLE
  {f32,f64}x_relaxed_{madd,nmadd}    (4 ternary)       TRIPLE
  {i8,i16,i32,i64}x_relaxed_laneselect (4 ternary)     TRIPLE
  {f32,f64}x_relaxed_{min,max}        (4 binary)        DOUBLE
  i16x8.relaxed_q15mulr_s             (binary)          hand loop
  i16x8.relaxed_dot_i8x16_i7x16_s     (binary)          hand loop
  i32x4.relaxed_dot_i8x16_i7x16_add_s (ternary)         hand loop

SIMDe's `simde/wasm/relaxed-simd.h` (already shipped in
`core/deps/simde`) provides 17 of the 20 intrinsics; q15mulr_s,
dot_i8x16_i7x16_s, and dot_i8x16_i7x16_add_s are missing so the
dispatch loop inlines a per-lane C implementation. The relaxed-
SIMD spec allows implementation-defined behavior on overflow for
those three (wrap vs. saturate); the impls here match the
strict-IEEE / saturating shape — same as the corresponding
non-relaxed ops — which is conformant and matches the SIMDe
hand-coded fallbacks for q15mulr_sat_s.

A new local `SIMD_TRIPLE_OP(simde_func)` macro pops 3 v128s and
hands them to a 3-arg intrinsic; same shape as `SIMD_DOUBLE_OP` /
`SIMD_SINGLE_OP` for two- and one-arg ops. `#undef`-ed at the end
of the gated block so the macro doesn't leak into the legacy
build.

Smoke tested via a 6-op WAT module (swizzle, madd, min,
laneselect, q15mulr_s, trunc_f32x4_s) compiled to wasm and run
through the `iwasm` CLI with `WAMR_BUILD_RELAXED_SIMD=1`:

  madd        = [110, 240, 390, 560]            ✓
  trunc_f32   = [1, -2, 3, -4]                   ✓
  min         = [1, 2, 2, 1]                     ✓
  q15mulr     = [0,0,1,1,3,4,6,-7]               ✓
  swizzle     = [15..0] (reverse)                 ✓
  laneselect  = (bitwise a/b mux per mask)       ✓

The `wasm_loader_prepare_bytecode` SIMD switch type checker
(commit 1) is already populated for the new opcodes, so the
relaxed-SIMD wasm validates through the loader and then reaches
the new dispatch cases here. The cmake flag that exposes the
feature (`WAMR_BUILD_RELAXED_SIMD`) is the next commit; this one
adds the runtime side gated on the eventual macro.
Lights up the dormant `WASM_FEATURE_RELAXED_SIMD` bit at
`aot_runtime.h:32` for the fast interpreter. Default `0` so a
build that doesn't explicitly opt in stays byte-identical to
upstream — the loader + dispatch added in the two prior commits
all sit behind `#if WASM_ENABLE_RELAXED_SIMD != 0`.

  * `WAMR_BUILD_RELAXED_SIMD=1` adds `-DWASM_ENABLE_RELAXED_SIMD=1`
    to the C compile line and prints `"Relaxed SIMD enabled"` next
    to the existing `"SIMD enabled"` line.

  * `WAMR_BUILD_RELAXED_SIMD=1 WAMR_BUILD_SIMD=0` fails fast with
    `FATAL_ERROR "WAMR_BUILD_RELAXED_SIMD=1 requires
    WAMR_BUILD_SIMD=1"`. Relaxed-SIMD is a superset of the base
    feature — the dispatch loop, frame_lp v128 cells, and SIMDe
    intrinsics it shares with legacy SIMD would all be compiled
    out otherwise.

  * Listed in the existing "feature summary" block alongside
    `"Fixed-width SIMD"` so `WAMR_INFO` output makes the new
    knob visible.

Verified locally on macOS-15 / aarch64:

  flag=0 (default):
    iwasm -f madd /tmp/relaxed_smoke.wasm
    -> WASM module load failed: invalid opcode 0xfd 100.

  flag=1:
    iwasm -f madd /tmp/relaxed_smoke.wasm
    -> <0x4370000042dc0000 0x440c000043c30000>:v128
       (correct f32x4 result for relaxed_madd)

  flag=1 simd=0:
    cmake -> "WAMR_BUILD_RELAXED_SIMD=1 requires WAMR_BUILD_SIMD=1"
    (configure aborts)
The two macros `SIMD_V128_TO_SIMDE_V128` and `SIMDE_V128_TO_SIMD_V128`
punt 16-byte values between WAMR's `V128` union-of-arrays and
SIMDe's compiler-intrinsic vector type (`int32x4_t` on aarch64,
`__m128i` on x86-64) at every SIMD case boundary. The previous
shape used `bh_memcpy_s`, which lives out-of-line in
`core/shared/utils/bh_common.c`. Without LTO the call doesn't
inline, so every conversion compiled into a real `bl` instruction
— three function calls on 3-operand SIMD ops (madd / nmadd /
laneselect / bitselect / dot_add) plus one on the store, for ~4
function calls per SIMD dispatch.

xctrace CPU Counters on the aarch64 M4 E-core, matmul-fma
workload (the relaxed-SIMD f32x4_relaxed_madd hot loop):

  before                  after
  Useful       78.1%      71.4%
  Processing    6.1%      23.3%
  Delivery     13.4%       2.9%   <- frontend stalls, the bottleneck
  Discarded     2.4%       2.5%
  total cycles  301M      733M    (over 5s vs 10.9s, more iters)

The 13.4% `Delivery` share — frontend / L1-I stall — vanished:
the SIMD-prefix region's case bodies were big enough (~50
instructions per relaxed_madd dispatch, dominated by `bl
memcpy_chk` chains and intermediate v128 spills) to push the
SIMD switch out of L1-I on the E-core. After the fix each case
body is ~15 instructions, all register-resident, no calls.

Per-case disassembly (`f32x4_relaxed_madd`):

  before                                after
  ~50 instructions                      ~15 instructions
  3x bl memcpy_chk                      0 calls
  4x v128 stack-spill load/store        3 frame_lp loads,
                                        1 frame_lp store,
                                        1 fmla.4s

`wasm_interp_call_func_bytecode` total instruction count drops
from 14,560 -> 8,735 (40% smaller, comfortably inside the
Icestorm 128 KiB L1-I budget alongside hot non-SIMD ops).

End-to-end wallclock on M4 E-core (`cargo run --release --bin
bench_relaxed_simd`):

  matmul simd128 (mul+add)
    WAMR before: 1.490 ms median
    WAMR after:  0.468 ms median   (3.2x speedup)
    Pulley:      1.217 ms median
  matmul relaxed-simd (FMA)
    WAMR before: 1.180 ms median
    WAMR after:  0.369 ms median   (3.2x speedup)
    Pulley:      0.921 ms median

WAMR now leads Pulley on both shapes (1.27x faster on
matmul-simd128, 2.50x faster on matmul-fma), and WasmEdge
interp by 6-7x. The fix applies to *all* SIMD ops, not just
the relaxed-SIMD ones — the macros are on the hot path for
every f32x4 / i32x4 / v128.load / v128.store in the fast
interpreter.

Correctness: `_Static_assert` upgrades the `bh_assert`
size-equality guard from runtime to compile-time so a future
divergence between V128 and simde_v128_t trips the build
rather than silently miscompiling. Semantically identical to
the pre-fix `bh_memcpy_s` for these fixed-size copies.
…ts/unit

Anticipates and addresses common WAMR maintainer review feedback on
the relaxed-SIMD PR (#3):

  * **HIGH — silent AOT mis-compile when RELAXED_SIMD=1 AOT=1**:
    the shared loader `prepare_bytecode` (`wasm_loader.c`) is
    reached by AOT/JIT/wamrc too. With this PR's commit 1 it
    accepts the new sub-opcodes 0x100..0x113, but the AOT path
    in `core/iwasm/compilation/aot_compiler.c:1494,2463,2639,2799`
    does `opcode = (uint8)opcode1`, silently aliasing
    `relaxed_swizzle` (0x100) into `SIMD_v128_load` (0x00) and
    reading a garbage memarg at codegen time.
    Reject the combination at cmake-configure time:
    `WAMR_BUILD_RELAXED_SIMD=1` now requires
    `WAMR_BUILD_FAST_INTERP=1` and explicitly rejects
    `WAMR_BUILD_AOT=1 / WAMR_BUILD_JIT=1 / WAMR_BUILD_FAST_JIT=1 /
     WAMR_BUILD_WAMR_COMPILER=1` with a diagnostic that points
    at `aot_compiler.c` and says "build fast-interp-only to use
    relaxed-SIMD until the AOT/JIT pipelines learn the wider
    sub-opcode range."

  * **`core/config.h` default for `WASM_ENABLE_RELAXED_SIMD`**:
    `#ifndef … #define … 0 #endif` block alongside `WASM_ENABLE_SIMD`
    and `WASM_ENABLE_SIMDE`. Cosmetic but matches WAMR's pattern
    for every other feature flag — non-cmake builds (e.g. CI lint
    that compiles a TU in isolation) still see a defined value.

  * **`tests/unit/relaxed-simd/`**: gtest-based unit test that
    loads + invokes a hand-encoded wasm module with
    `f32x4.relaxed_madd`. Two tests:
      - `load_module_with_relaxed_madd`: asserts the loader
        accepts the module (pre-PR, this fails with
        `"invalid opcode 0xfd 100"`).
      - `invoke_relaxed_madd_returns_fma_result`: invokes the
        export, asserts the bit pattern of two f32 lanes
        (`0x42DC0000` = 110.0 and `0x43700000` = 240.0) — both
        single-rounded FMA hardware and split mul+add produce
        the same result here since every input/product/sum is
        exactly representable in f32.
    Wired into `tests/unit/CMakeLists.txt` next to the parallel
    `exception-handling` test target. Gated on
    `WAMR_BUILD_RELAXED_SIMD=1 + WAMR_BUILD_FAST_INTERP=1`.

  * **Hand-rolled `q15mulr_s` swap → SIMDe intrinsic**: the patch-2
    case body for `SIMD_i16x8_relaxed_q15mulr_s` previously had a
    lane-by-lane fallback loop (because SIMDe doesn't ship a
    `relaxed_q15mulr_s` intrinsic). SIMDe DOES ship the
    non-relaxed `simde_wasm_i16x8_q15mulr_sat` (strict-saturating
    `sqrdmulh.h8` on aarch64), and the relaxed spec explicitly
    permits saturating behaviour. Swap to that — smaller code,
    NEON hardware path, bit-identical to the hand loop on the
    INT16_MIN² overflow boundary (verified locally via
    `q15mulr_overflow` test case: both produce 0x7ffe7fff7fff).

  * Docs nit: comment in patch-2 `HANDLE_OP(WASM_OP_SIMD_PREFIX)`
    referenced `emit_uint16(opcode1)` but the actual call is
    `wasm_loader_emit_int16(opcode1)`. Fixed.

Audit items verified OK without code change:
  - `clang-format-14` clean across all 5 commits.
  - `-Wpedantic` not enabled in `build-scripts/warnings.cmake` so
    the `({ })` GCC statement-expression in the V128 conversion
    macros is fine.
  - IR encoding's 2-byte sub-opcode advance via
    `wasm_loader_emit_int16` is safe on non-unaligned platforms
    (STORE_U16 with alignment asserts; legacy `emit_byte` also
    consumed 2 bytes there via padding).
  - `WASM_ENABLE_SIMDE` is always set when SIMD+FAST_INTERP are
    set, so the nested `#include "simde/wasm/relaxed-simd.h"`
    can't be reached without SIMDe being in scope.
  - `AOT_CURRENT_VERSION` correctly not bumped — no AOT struct
    changed.

References: WAMR PR bytecodealliance#4713 (woodsmc) made tests mandatory in
CONTRIBUTING.md; `@lum1n0us`'s PR bytecodealliance#4837 review pattern on
fast-interp EH ("follow `tests/unit/interpreter`") shapes the
new `tests/unit/relaxed-simd/` layout. CODEOWNERS will route
review to `@loganek @lum1n0us @no1wudi @TianlongLiang @yamt`.
…diate

Reviewer note (chatgpt-codex-connector on
#3): summing all four i8 byte
products directly into the i32 lane skipped the i16 truncation point
that the spec defines via i16x8.relaxed_dot + extadd_pairwise_i16x8_s.

For lanes with a=b=0x80, the previous impl produced 65536+c, which is
outside the spec-allowed result set {-65536+c, 65534+c, -1+c} (wrap or
saturate at each of two pair sums). Fix preserves the i16 intermediate
using wrap, matching the i16x8 dot case immediately above.

Worked example, a=b=0x80 in all four lanes:
  lo_pair = (-128*-128) + (-128*-128) = 32768
  (int16)32768           = -32768  (wrap)
  hi_pair = 32768 → -32768
  ext_sum = (i32)-32768 + (i32)-32768 = -65536
  result  = -65536 + c   ✓ wrap+wrap allowed value
Two new tests for the chatgpt-codex-connector finding on
#3:

  1. `dot_add_i16_intermediate_overflow_regression` — pins the
     spec-conformant -65536 result for the input pattern that
     used to produce 65536 (outside the spec-allowed set
     {-65536, -1, 65534}). Future refactor back to a direct-i32-
     sum impl fails immediately.

  2. `dot_s_i16_overflow_pin_sibling_op` — pins the sibling
     `i16x8.relaxed_dot_i8x16_i7x16_s` impl at the same overflow
     boundary. The current impl correctly truncates via the
     `(int16)sum` cast (wasm_interp_fast.c:8103); the test makes a
     future refactor that drops the cast loudly fail.

Both inputs use a = b = 0x80 in all 16 bytes — the classic case
where the i8×i8 pair sum overflows i16 and the truncation point
between "i16x8 relaxed dot" and "extadd_pairwise_i16x8_s"
distinguishes spec-conformant impls from naive direct-sum impls.

Bytecode for both modules was generated via
`wat2wasm --enable-relaxed-simd` on minimal known-good WAT
(documented inline in the static-array comments) and inlined to
avoid a wabt/wat-runtime dependency at test time.
The Coding Guidelines CI check uses `clang-format-14` and flagged
the line break I chose in the previous "preserve i16 intermediate"
commit. Newer clang-format-22 happens to accept both shapes;
clang-format-14 prefers the cast-then-paren-group form:

    result.i32x4[lane] =
        (int32)((uint32)ext_sum
                + (uint32)v3.i32x4[lane]);

Functionally identical. No behaviour change.
Two more relaxed-SIMD boundary tests in the unit suite, both
exercising implementation-defined behaviors that the dot-product
regression-tests already established for this PR but that weren't
yet covered for these ops:

  1. `q15mulr_int16_min_squared_either_sat_or_wrap` — the
     INT16_MIN * INT16_MIN case. Spec relaxes the result of
     `sat_s((a*b + 0x4000) >> 15)` so an implementation may pick
     either the IEEE/x86 PMULHRSW saturate (0x7fff) or the
     truncate (0x8000). Test uses *membership* (either of the two
     allowed values) rather than exact equality, so a future
     switch to wrap doesn't break the test.

  2. `madd_inf_times_zero_propagates_nan` — adversarial input for
     the fused/unfused FMA path (`f32x4.relaxed_madd`). IEEE 754
     §7.2 makes `Inf * 0` an invalid multiply that produces NaN
     regardless of the subsequent add, so both `fma(Inf, 0, c)`
     and unfused `Inf * 0 + c` produce *some* NaN — but the
     specific NaN bit pattern is impl-defined. Test checks each
     lane against the IEEE-754 NaN predicate (exp == 0xff and
     fraction != 0) rather than an exact bit pattern.

Locally exercised via `iwasm -f`:
  q15mulr result: 0x7fff (saturate, current SIMDe lowering)
  madd_inf_times_zero result: 0x7fc00000 per lane (canonical f32 NaN)

Both fit the spec-allowed sets the tests describe; the membership
assertions confirm without overfitting to the specific bit
pattern.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant